Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Hierarchies #1432

Open
wants to merge 27 commits into
base: dev
Choose a base branch
from

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented May 3, 2023

The openPMD standard works by defining "what must be there", but does not impose restrictions as to "what must not be there". By this principle, openPMD is an extensible standard.
So far, standard extensions relied mostly on defining additional metadata in terms of attributes, e.g. for storing the name of the employed field solver for the ED-PIC extension. Custom hierarchies and custom n-dimensional datasets ("heavy" data in comparison to lightweight metadata) have not been employed so far despite the theoretical possibility to do so, granted by the openPMD standard. The major hindrance to such data organization has been the lacking support at the level of the openPMD-api, i.e. the implementation of the standard.

As the first part of this PR, the openPMD-api now supports writing custom-defined hierarchies and datasets within the basepath, i.e. within Iterations. This change is entirely independent from the standard as it makes use of the already existing liberty within the standard's conception as explained in the introduction.

This alone finds useful applications already:

  • Data that has been marked up according to another standard can be embedded side-by-side with openPMD-formatted particle-mesh data. A short example is given as part of this PR that writes an openPMD-formatted temperature mesh side by side with a simple NeXus example. The resulting dataset is shown below:
      string       /basePath                                                attr   = "/data/%T/"
      string       /date                                                    attr   = "2024-08-12 16:58:01 +0200"
      string       /iterationEncoding                                       attr   = "groupBased"
      string       /iterationFormat                                         attr   = "/data/%T/"
      string       /meshesPath                                              attr   = "meshes/"
      string       /openPMD                                                 attr   = "1.1.0"
      uint32_t     /openPMDextension                                        attr   = 0
      string       /software                                                attr   = "openPMD-api"
      string       /softwareVersion                                         attr   = "0.16.0-dev"
      double       /data/100/dt                                             attr   = 1
      double       /data/100/time                                           attr   = 0
      double       /data/100/timeUnitSI                                     attr   = 1
      string       /data/100/Scan/NX_class                                  attr   = "NXentry"
      string       /data/100/Scan/data/NX_class                             attr   = "NXdata"
      string       /data/100/Scan/data/axes                                 attr   = {"two_theta"}
      int64_t      /data/100/Scan/data/counts                               {15} = 0 / 0
      string       /data/100/Scan/data/counts/long_name                     attr   = "photodiode counts"
      string       /data/100/Scan/data/counts/units                         attr   = "counts"
      string       /data/100/Scan/data/signal                               attr   = "counts"
      double       /data/100/Scan/data/two_theta                            {15} = 0 / 0
      string       /data/100/Scan/data/two_theta/long_name                  attr   = "two_theta (degrees)"
      string       /data/100/Scan/data/two_theta/units                      attr   = "degrees"
      uint8_t      /data/100/Scan/data/two_theta_indices                    attr   = {0}
      string       /data/100/Scan/default                                   attr   = "data"
      double       /data/100/meshes/temperature                             {5, 5} = 0 / 0
      string       /data/100/meshes/temperature/axisLabels                  attr   = {"x", "y"}
      string       /data/100/meshes/temperature/dataOrder                   attr   = "C"
      string       /data/100/meshes/temperature/geometry                    attr   = "cartesian"
      double       /data/100/meshes/temperature/gridGlobalOffset            attr   = {0, 0}
      double       /data/100/meshes/temperature/gridSpacing                 attr   = {1, 1}
      double       /data/100/meshes/temperature/gridUnitSI                  attr   = 1
      long double  /data/100/meshes/temperature/position                    attr   = {0.5, 0.5}
      float        /data/100/meshes/temperature/timeOffset                  attr   = 0
      double       /data/100/meshes/temperature/unitDimension               attr   = {0, 0, 1, 0, 0, 0, 0}
      double       /data/100/meshes/temperature/unitSI                      attr   = 1
    
  • Embedding non-physical information into output files. An example is the particle-in-cell simulation PIConGPU that uses openPMD for regular output as well as for checkpoint-restart output. In the case of checkpoint-restart, internal program state must be serialized along with the physical state of the simulation, currently only possible by pretending that the internal state is a mesh which confuses many post-processing tools such as visualizers. PIConGPU has been adapted to make use of this change on this Git tree, check here for a diff. A shortened example output is pasted below, demonstrating that internal state information is now cleanly separated from physical data:
      float     /data/100/fields/E/x                                      {192, 1024, 192}
      float     /data/100/fields/E/y                                      {192, 1024, 192}
      float     /data/100/fields/E/z                                      {192, 1024, 192}
      float     /data/100/particles/e/momentum/x                          {71958528}
      float     /data/100/particles/e/momentum/y                          {71958528}
      float     /data/100/particles/e/momentum/z                          {71958528}
      float     /data/100/particles/e/position/x                          {71958528}
      float     /data/100/particles/e/position/y                          {71958528}
      float     /data/100/particles/e/position/z                          {71958528}
      int32_t   /data/100/particles/e/positionOffset/x                    {71958528}
      int32_t   /data/100/particles/e/positionOffset/y                    {71958528}
      int32_t   /data/100/particles/e/positionOffset/z                    {71958528}
      float     /data/100/particles/e/weighting                           {71958528}
      char      /data/100/picongpu_internal/RNG/RNGProvider3XorMin        {48, 128, 147456}
      uint64_t  /data/100/picongpu_internal/idProvider/nextId             {1, 1, 1}
      uint64_t  /data/100/picongpu_internal/idProvider/startId            {1, 1, 1}
    

Building on top of this, the other logical component of this PR consists in the support of this standard extension. While the PR as described so far brings custom hierarchies and datasets to the openPMD-api in a way that is transparent to the standard itself, the purpose of this next standard extension is to now make the standard aware of these hierarchies by embedding openPMD markup within them.

The schematic idea behind this is pictured below:
267274652-a4a4a4ac-636f-4349-bc14-c4e4a2cc36a1

With this, the data organization can step back into openPMD markup from anywhere within a custom-defined hierarchy. This further extends the use of this PR to:

  • Using openPMD markup within another standard, rather than merely beside it. This is currently being applied exploratively in this script for a sample dataset collected in the POLARIS laboratory.
  • For more complex setups, this permits a better organization of output data. As an example, meshes can be of different kinds such as 3-dimensional physical fields or 2-dimensional images; also there might be similar kinds of dependencies between particle data. It is desirable to group such data in a way that reflects the logical adjacencies and interdependencies between them.
  • A particular instance of the above is mesh refinement, currently proposed in a standard extension as a suffix-based naming scheme. Switching to an approach based on custom hierarchies, this comment details a more natural and more easily parsed approach at mesh refinement. A mesh-refined dataset of this type might be structured as follows:
    /data/0/refined_mesh_levels/0/meshes/E
    /data/0/refined_mesh_levels/0/meshes/B
    /data/0/refined_mesh_levels/1/meshes/E
    /data/0/refined_mesh_levels/1/meshes/B
    /data/0/refined_mesh_levels/2/meshes/E
    /data/0/refined_mesh_levels/2/meshes/B
    +++++++ ––––––––––––––––––––– ++++++++
    standard        custom        standard
    
    /data/0/simulation_internal/some_checkpointing_info
    +++++++ –––––––––––––––––––––––––––––––––––––––––––
    standard                  custom
    

TODO

  • Merge first: Remove necessity for RecordComponent::SCALAR #1154
  • Await Pybind11 release that has merged this fix: Introduce recursive_container_traits pybind/pybind11#4623
  • Implement custom groups at the Iteration level that can hold custom attributes
  • Implement custom datasets inside custom hierarchy
  • Implement openPMD-defined meshes/particles-data from anywhere in the hierarchy
  • Implement extended meshesPath/particlesPath
  • Update the openPMD standard, see Allow user to store non-openPMD information openPMD-standard#115 (comment)
  • Lenient parsing in CustomHierarchy class
  • Maybe lazy parsing of the custom hierarchy?
  • Use the new SharedAttributableData pattern to better implement variable-based encoding (where series.iterations and series.iterations[0] are the same backend objects)
  • Replace Iteration::meshes with Iteration::mesh("subdir/E") and Iteration::allMeshes() -> std::map<std::string, Mesh>, similar Iteration::species("subdir/e") and Iteration::allSpecies() -> std::map<std::string, ParticleSpecies>. But should it be species("subdir/particles/e") or species("subdir/e")?
  • Generalize to Attributable::openAsCustomHierarchy()?

Diff: https://github.com/franzpoeschel/openPMD-api/compare/topic-remove-scalar-component..topic-custom-hierarchies


private:
template <typename... Arg>
iterator makeIterator(Arg &&...arg)

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable arg is not used.
return iterator{this, std::forward<Arg>(arg)...};
}
template <typename... Arg>
const_iterator makeIterator(Arg &&...arg) const

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable arg is not used.
REQUIRE(r["x"].resetDataset(dset).numAttributes() == 0); /* unitSI */
// REQUIRE(r["y"].unitSI() == 1);
REQUIRE(r["y"].resetDataset(dset).numAttributes() == 0); /* unitSI */
// REQUIRE(r["z"].unitSI() == 1);

Check notice

Code scanning / CodeQL

Commented-out code Note test

This comment appears to contain commented-out code.
// unitSI is set upon flushing
// REQUIRE(r["x"].unitSI() == 1);
REQUIRE(r["x"].resetDataset(dset).numAttributes() == 0); /* unitSI */
// REQUIRE(r["y"].unitSI() == 1);

Check notice

Code scanning / CodeQL

Commented-out code Note test

This comment appears to contain commented-out code.
@@ -966,6 +968,27 @@
#endif
}

TEST_CASE("baserecord_test", "[core]")

Check notice

Code scanning / CodeQL

Unused static function Note test

Static function C_A_T_C_H_T_E_S_T_36 is unreachable (
autoRegistrar37
must be removed at the same time)
Comment on lines 874 to 893
// for (auto it = this->container().begin(); it != end; ++it)
// {
// if (it->first == RecordComponent::SCALAR)
// {
// this->container().erase(it);
// throw error::WrongAPIUsage(detail::NO_SCALAR_INSERT);
// }
// }

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Comment on lines 855 to 874
// for (auto it = this->container().begin(); it != end; ++it)
// {
// if (it->first == RecordComponent::SCALAR)
// {
// this->container().erase(it);
// throw error::WrongAPIUsage(detail::NO_SCALAR_INSERT);
// }
// }

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
@@ -1353,3 +1378,44 @@
UniquePtrWithLambda<int[]> arrptrFilledCustom{
new int[5]{}, [](int const *p) { delete[] p; }};
}

TEST_CASE("scalar_and_vector", "[core]")

Check notice

Code scanning / CodeQL

Unused static function Note test

Static function C_A_T_C_H_T_E_S_T_60 is unreachable (
autoRegistrar61
must be removed at the same time)
@@ -156,6 +159,39 @@
}
}

TEST_CASE("custom_hierarchies", "[core]")

Check notice

Code scanning / CodeQL

Unused static function Note test

Static function C_A_T_C_H_T_E_S_T_4 is unreachable (
autoRegistrar5
must be removed at the same time)
@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 2 times, most recently from c8a68a5 to 6c87958 Compare May 11, 2023 09:19
@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 2 times, most recently from 86d8a73 to 399e6cd Compare May 30, 2023 12:43
@@ -156,6 +159,129 @@
}
}

TEST_CASE("custom_hierarchies", "[core]")

Check warning

Code scanning / CodeQL

Poorly documented large function Warning test

Poorly documented function: fewer than 2% comments for a function of 194 lines.
@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jun 19, 2023

comment removed, updated version in comments below

@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 2 times, most recently from 8c28fab to 605bd55 Compare June 29, 2023 11:11
test/CoreTest.cpp Fixed Show fixed Hide fixed
@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jul 13, 2023

For the meshesPath (equivalently for particlesPath), I have now implemented a prototype that does the following:

A path /data/0/custom/group/meshes/E is a mesh if the meshesPath contains any of the following:

  1. Full path to the group containing the mesh: /custom/group/meshes/
  2. Full path to the mesh itself: /custom/group/meshes/E No longer supported
  3. Shorthand notation: meshes/

The underlying rule: Full paths are denoted by a leading slash and are based on the data path (/data/%T)

Remark: The shorthand notation achieves backwards compatibility with old openPMD files

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jul 13, 2023

One nontrivial design question is how to deal with the traditional openPMD hierarchy, especially with the paths /data/%T/meshes and /data/%T/particles. There is no definition of any form of physical data for those groups in the openPMD standard, a normal openPMD file contains no attributes /data/%T/meshes/<attr_name>.

This suggests to me that in the extended openPMD standard with custom hierarchies these paths should be treated as "nothing special". Rather, they become the canonical, but not mandatory layout/organization of a simple openPMD dataset.

Two somewhat tricky consequences from this point of view:

1. There might be more than 1 meshes paths in the same group
E.g. the paths /data/%T/meshes and /data/%T/images might exist side by side. In the openPMD standard, this is no problem, in the openPMD-api this becomes challenging.
The problem is with the member Iteration::meshes (made even worse by the fact that it's not a getter method, but a data member). Should it point to /data/%T/meshes? To a union of both? What about writing?

Imo, the best solution is to consider Iteration::meshes a shorthand API that should not be used in more complex setups. Rather, since /data/%T/meshes is now just another normal path in the custom Iteration hierarchy, one should access iteration["meshes"].asContainerOf<Mesh>() for clarity.

Iteration::meshes will point to the first user-specified meshes path that takes the form of a shorthand notation. E.g., after series.setMeshesPath({"fields/"}), the call iteration.meshes will be the same as iteration["fields"].asContainerOf<Mesh>(). This ensures backwards compatibility.

(Note: Since Iteration::meshes is unfortunately a member and not a method, this means that the meshes path must be set before creating or opening any Iteration. And it was enough fighting with pointers to get things to that state.)

2. There might be custom data inside /data/%T/meshes
This is not really a problem, but could be unexpected. When setting series.setMeshesPath({"/meshes/E"}), you state that only the E field is a mesh. Since /data/%T/meshes is otherwise "just a regular group" with no special meaning, there might be other data in there, too, e.g. /data/%T/meshes/custom/hierarchy. It's the job of the user to create a meaningful data layout here.

With the more restricted definition of meshesPath and particlesPath, this is no longer supported.

src/CustomHierarchy.cpp Fixed Show fixed Hide fixed
@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 2 times, most recently from 53f968c to ba10099 Compare August 1, 2023 13:37
}
}

TEST_CASE("custom_hierarchies_no_rw", "[core]")

Check notice

Code scanning / CodeQL

Unused static function Note test

Static function C_A_T_C_H_T_E_S_T_6 is unreachable (
autoRegistrar7
must be removed at the same time)
@franzpoeschel franzpoeschel mentioned this pull request Aug 1, 2023
1 task
@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 4 times, most recently from 31c7a25 to 1d47d17 Compare August 3, 2023 09:25
@pgrete
Copy link

pgrete commented Sep 2, 2024

What's the status/schedule here?
I'm asking as we probably need this for a similar use case (storing additional data for checkpoint files) or I need a different hint with regard to our use case.
More specifically, I want to store some parameters (scalars and vectors of int, float, double,...) that may contain to different "packages".
Our current (in other output format) paradigm has been to store them in attributes called Params/PACKAGE_NAME/PARAM_NAME.
As far as I understand Params and PACKAGE_NAME could be a "group" within the standard (which is logically consistent with out data model).
Writing these with the openpmd-api also works (I see the data in the output files).
However, reading does not work

Reading 'Params/tracers/t_lookback' with type: St6vectorIdSaIdEE
terminate called after throwing an instance of 'openPMD::error::NoSuchAttribute'
  what():  Params/tracers/t_lookback

I assume that this is because the attributes in "groups" are not parsed by default.
Here's what OpenPMD sees (print the iteration->attributes():

Contains attribute: BlocksPerPE
Contains attribute: BoundaryConditions
Contains attribute: Coordinates
Contains attribute: IncludesGhost
Contains attribute: InputFile
Contains attribute: MaxLevel
Contains attribute: MeshBlockSize
Contains attribute: Multilevel
Contains attribute: NBDel
Contains attribute: NBNew
Contains attribute: NCycle
Contains attribute: NGhost
Contains attribute: NumDims
Contains attribute: NumMeshBlocks
Contains attribute: Refine
Contains attribute: RootGridDomain
Contains attribute: RootGridSize
Contains attribute: RootLevel
Contains attribute: WallTime
Contains attribute: derefinement_count
Contains attribute: dt
Contains attribute: loc.level-gid-lid-cnghost-gflag
Contains attribute: loc.lx123
Contains attribute: time
Contains attribute: timeUnitSI

and here's what's in the file

$ bpls ../parthenon.opmd.00002.bp -A 
  string    /author                                                                attr
  string    /basePath                                                              attr
  uint8_t   /bla                                                                   attr
  string    /comment                                                               attr
  int32_t   /data/2/BlocksPerPE                                                    attr
  string    /data/2/BoundaryConditions                                             attr
  string    /data/2/Coordinates                                                    attr
  int32_t   /data/2/IncludesGhost                                                  attr
  string    /data/2/InputFile                                                      attr
  int32_t   /data/2/MaxLevel                                                       attr
  int32_t   /data/2/MeshBlockSize                                                  attr
  int32_t   /data/2/Multilevel                                                     attr
  int32_t   /data/2/NBDel                                                          attr
  int32_t   /data/2/NBNew                                                          attr
  int32_t   /data/2/NCycle                                                         attr
  int32_t   /data/2/NGhost                                                         attr
  int32_t   /data/2/NumDims                                                        attr
  int32_t   /data/2/NumMeshBlocks                                                  attr
  double    /data/2/Params/Hydro/AdiabaticIndex                                    attr
  uint8_t   /data/2/Params/Hydro/calc_c_h                                          attr
  uint8_t   /data/2/Params/Hydro/calc_dt_hyp                                       attr
  double    /data/2/Params/Hydro/cfl                                               attr
  double    /data/2/Params/Hydro/cfl_diff                                          attr
  double    /data/2/Params/Hydro/dt_diff                                           attr
  uint8_t   /data/2/Params/Hydro/first_order_flux_correct                          attr
  double    /data/2/Params/Hydro/max_dt                                            attr
  int32_t   /data/2/Params/Hydro/nhydro                                            attr
  int32_t   /data/2/Params/Hydro/nscalars                                          attr
  uint8_t   /data/2/Params/Hydro/pack_in_one                                       attr
  int32_t   /data/2/Params/Hydro/scratch_level                                     attr
  double    /data/2/Params/Hydro/turbulence/accel_rms                              attr
  int32_t   /data/2/Params/Hydro/turbulence/inject_n_blobs                         attr
  int32_t   /data/2/Params/Hydro/turbulence/inject_once_at_cycle                   attr
  double    /data/2/Params/Hydro/turbulence/inject_once_at_time                    attr
  uint8_t   /data/2/Params/Hydro/turbulence/inject_once_on_restart                 attr
  double    /data/2/Params/Hydro/turbulence/kpeak                                  attr
  int32_t   /data/2/Params/Hydro/turbulence/rescale_once_at_cycle                  attr
  double    /data/2/Params/Hydro/turbulence/rescale_once_at_time                   attr
  uint8_t   /data/2/Params/Hydro/turbulence/rescale_once_on_restart                attr
  double    /data/2/Params/Hydro/turbulence/rescale_to_rms_Ms                      attr
  uint32_t  /data/2/Params/Hydro/turbulence/rseed                                  attr
  double    /data/2/Params/Hydro/turbulence/sol_weight                             attr
  double    /data/2/Params/Hydro/turbulence/t_corr                                 attr
  uint8_t   /data/2/Params/tracers/enabled                                         attr
  int32_t   /data/2/Params/tracers/n_lookback                                      attr
  double    /data/2/Params/tracers/num_tracers_per_cell                            attr
  int32_t   /data/2/Params/tracers/rng_seed                                        attr
  double    /data/2/Params/tracers/t_lookback                                      attr
  int32_t   /data/2/Refine                                                         attr
  double    /data/2/RootGridDomain                                                 attr
  int32_t   /data/2/RootGridSize                                                   attr
  int32_t   /data/2/RootLevel                                                      attr
  double    /data/2/WallTime                                                       attr
  int32_t   /data/2/derefinement_count                                             attr
  double    /data/2/dt                                                             attr
  int32_t   /data/2/loc.level-gid-lid-cnghost-gflag                                attr
  int64_t   /data/2/loc.lx123                                                      attr
  string    /data/2/meshes/acc_acc_0_lvl0/axisLabels                               attr
  string    /data/2/meshes/acc_acc_0_lvl0/dataOrder                                attr
  string    /data/2/meshes/acc_acc_0_lvl0/geometry                                 attr
...

Any short/long term recommendations?

franzpoeschel and others added 27 commits November 7, 2024 10:18
Introduction of iteration["meshes"].asContainerOf<Mesh>() as a more
explicit variant for iteration.meshes.
TODO: Since meshes/particles can no longer be directly addressed with
this, maybe adapt the class hierarchy to disallow mixed groups that
contain meshes, particles, groups and datasets at the same time.

Only maybe though..
The have their own meaning now and are no longer just carefully maintained
for backwards compatibility.
Instead, they are supposed to serve as a shortcut to all openPMD data
found further down the hierarchy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants